A AfterWork 


Feature Engineering with Python 


Learning Outcomes 


By the end of this topic, you will have achieved the following learning outcomes: 


| can understand the concept of feature engineering to improve model performance. 
| can apply feature engineering techniques to solve modeling problems. 

| can differentiate between different feature engineering techniques. 

| can understand when to apply different feature engineering techniques. 


“Predicting the future isn’t magic, it’s artificial intelligence.” ~Dave Waters 


Overview 


Reading 


Feature engineering is the process of transforming features when creating a model 
using machine learning or statistical modelling. 


For example, while predicting property prices, we would work on ensuring that the 
features at hand i.e the size of the house, number of bedrooms, location etc, is properly 
transformed to be taken as input to the mode that will determine the value of the 
property. 


It is normally expected that we spend most of our time building a machine learning 
pipeline while performing feature engineering or data cleaning. 


Importance of Feature Engineering 
We perform feature engineering because we want our features to be in the best shape 
possible for our model. Having better features would mean the following: 


e Better features mean flexibility 
o The flexibility of good features would allow us to use less complex models 
that are faster to run, easier to understand and easier to maintain. 
e Better features mean simpler models 
o With well-engineered features, we can have less than optimal parameters 
(which is usually the case) and still get good results. We do not need to 
work as hard to pick the suitable models and the most optimized 
parameters. 


While performing engineering, we would be required to create a baseline model and 
then later compare its performance with other alternative models. 


When we are able to settle on one particular model that works best for our case, we 
would then resolve to perform feature engineering techniques that would substantially 
improve the accuracy. 


Some of such techniques would include encoding categorical features so the model can 
make better use of the information, generating new features to provide more information 
for the model and selecting features to reduce overfitting and increase prediction speed. 
We will see more of such feature engineering techniques below. 


Feature Engineering Techniques 


Feature engineering entails feature understanding, feature improvement, feature 
construction, feature selection, feature transformation and feature learning. 


Feature Understanding 
e During this step, we try to understand the features at hand and why we need to 
take them into consideration in our model. This would entail understanding 
structured/unstructured data, qualitative/quantitative data, then looking into the 
four levels of data at hand i.e. nominal, ordinal, interval and ratio. 


Feature Improvement 
e We would ensure that our dataset is easy to understand and work with from the 

onset. This means performing various data cleaning methods i.e, handling 

missing data, handling outliers, performing feature scaling methods such as 

standardisation and normalisation. 
Note: Feature scaling is a technique to standardize the independent 
features present in the data in a fixed range i.e. we might decide to 
transform one feature to fit a scale of 0 - 1. 


o Normalization 


m This is a scaling technique in which values are shifted and 


rescaled so that they end up ranging between 0 and 1. We use 
normalization when we know that the distribution of your data 
does not follow a Gaussian distribution. This can be useful in 
algorithms that do not assume any distribution of the data like 
K-Nearest Neighbors. 


o Standardization 
m This is the other scaling technique where the values are centred 


around the mean with a unit standard deviation. This means that 
the mean of the attribute becomes zero and the resultant 
distribution has a unit standard deviation. It can be helpful in cases 
where the data follows a Gaussian distribution. However, this does 
not have to be necessarily true. Also, unlike normalization, 
standardization does not have a bounding range. 


Feature Construction 
This step would involve imputing categorical features, encoding categorical 
variables, extending numerical features and performing text-specific feature 


construction. 


Acommon mistake people make when they start predictive modeling is to focus 
on data already available. Instead, they should be considering what data is 
required. This mistake often leads to the following problems: 

o Important predictor variables end up being left out of the model. 


For example, in a model predicting property prices, knowledge of 
the type of property (e.g., house, apartment, condo, retail, office, 
industrial) is crucially important. 


o Variables that should be created from available data are not created. 


Feature Selection 


For example, a good predictor of many health outcomes is the 
Body Mass Index (BMI). To calculate BMI, you have to divide a 
person's weight by the square of their height. To build a good 
predictive model of health outcomes we need to know enough to 
work out that you need to create this variable as a feature for your 
model. If we just include height and weight in the model, the 
resulting model will likely perform worse than a model that 
includes BMI, height, and weight as predictors, along with other 
relevant variables (e.g., diet, a ratio of waist to hip circumference). 


This process involves the decision to select which predictor variables should be 


included in the model. This becomes even more important when the number of 


features is very large. We would need not use every feature at our disposal for 


creating an algorithm. 


e There are various reasons to use feature selection which are as follows: 


To enable the machine learning algorithm to train faster. 

To reduce the complexity of a model and make it easier to interpret. 
To improve the accuracy of a model if the right subset is chosen. 
To reduce overfitting of our models. 


Filter Methods 


The selection of features is independent of any machine learning 
algorithms. Features are selected on the basis of their scores in various 
statistical tests for their correlation with the outcome variable. A filter 
method would be to use Pearson’s correlation coefficient to resolve the 
most critical features in a dataset. 


Feature Transformations 


Feature transformation involves manipulating a predictor variable in some 
way so as to improve its performance in the predictive model. A variety of 
considerations come into play when transforming models, including 
performing principal component analysis and linear discriminant analysis. 

m Principal component analysis (PCA) would deal with reducing the 
dimensionality of a dataset by projecting the data to a 
lower-dimensional subspace (from a higher dimensional 
subspace) which captures the “essence” of the data. 

m Linear discriminant analysis (LCA) is a type of analysis used in 
supervised learning, that deals with reducing the dimensionality of 
a dataset by projecting the data to a lower-dimensional subspace 
(from a higher dimensional subspace). 

m Wecan use both PCA while working with similar problems 
however, in the case of uniformly distributed data, LDA almost 
always performs better than PCA. However if the data is highly 
skewed (irregularly distributed) then it is advised to use PCA since 
LDA can be biased towards the majority class. 


Wrapper Methods 


We use a subset of features and train a model using them. Based on 
what we learn from the previous model, we decide to add or remove 
features from our subset. 

Wrapper methods evaluate all possible combinations of the features and 
select the combination that produces the best result for a specific 
machine learning algorithm. This helps to find the best set of features for 
a specific algorithm, but can be computationally very expensive. More on 
these set of features may not be optimal for every other machine learning 
algorithm. 


e Common wrapper methods include recursive feature elimination, step 
forward feature selection and step backward feature selection methods. 

o Recursive feature elimination (RFE) method works by 
recursively removing attributes and building a model on those 
attributes that remain. 

o During step forward feature selection, you start with no 
variables in the model, testing the addition of each variable using 
a chosen model fit criterion, adding the variable (if any) whose 
inclusion gives the most statistically significant improvement of the 
fit, and repeating this process until none improves the model to a 
statistically significant extent. 

o Step backward feature selection involves starting with all 
candidate variables, testing the deletion of each variable using a 
chosen model fit criterion, deleting the variable (if any) whose loss 
gives the most statistically insignificant deterioration of the model 
fit, and repeating this process until no further variables can be 
deleted without a statistically insignificant loss of fit. 


Feature Learning 

This involves the automatic identification and use of features in raw data, where we use 
learning methods such as autoencoders and Boltzmann machines to automatically 
extract essential features from our dataset. 
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